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Abstract — For a stationary stochastic process {X n } with values 
in some set A, a finite word w 6 A K is called a memory word 
if the conditional probability of Xq given the past is constant on 
the cylinder set defined by XZ X K = w. It is a called a minimal 
memory word if no proper suffix of w is also a memory word. 
For example in a A'-step Markov processes all words of length 
K are memory words but not necessarily minimal. We consider 
the problem of determining the lengths of the longest minimal 
memory words and the shortest memory words of an unknown 
process {X n } based on sequentially observing the outputs of a 
single sample {£i,£2, •••£«}• We will give a universal estimator 
which converges almost surely to the length of the longest minimal 
memory word and show that no such universal estimator exists 
for the length of the shortest memory word. The alphabet A may 
be finite or countable. 

Index Terms — Markov chains, order estimation, probability, 
statistics, stationary processes, stochastic processes 



I. Introduction 

For a stationary stochastic process {X n } with values in 
some set A, a finite word w € A K is called a memory word if 
for the conditional probability of Xq given the past is constant 
on the cylinder set defined by XZ]^ — w. We are using 
here the customary notation {X^ = Xi, Xi+\, ■■■Xj}. For 
example in a K-step Markov processes all words of length 
K are memory words. However, in general A"-step Markov 
processes may also have short memory words, cf. Buhlmann 
and Wyner J2j. Naturally any left extension of a memory 
word is also a memory word and it is natural to consider 
the minimal memory words, namely those none of whose 
proper suffixes are memory words. We consider the problem of 
determining the lengths of the longest minimal memory words 
and of the shortest memory words of an unknown process 
{X n } based on sequentially observing the outputs of a single 
sample ^> •••£«.}■ That is to say we would like to have 
sequences of functions L n ,S n so that £ n (£i, £2> ■■■£n) w iH 
converge almost surely to the K in case the process is in- 
step Markov (but not (K — l)-step Markov), and to infinity 
otherwise, while SVi(£i,£2, •••£n) will converge almost surely 
to the length of the shortest memory word in the process. 

Most previous work of this kind (see for example Csiszar 
and Shields 0, Csiszar and Ryabko et. al. |18|) was 
restricted to finite state processes. Our estimators will allow 
for countable alphabets and this precludes the use of a priori 
exponential estimates which can be used in the class of finite 
state if -step Markov chains. In the next section we will give 
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a universal estimator which converges almost surely to the 
length of the longest minimal memory word i.e. the minimal 
order of the process. This will be a finite number in case the 
process is a finite step Markov chain and infinity othewise. 
This is somewhat simpler than the estimators that we gave 
in 04). On the other hand, we will show in the last section, 
that no sequence like S„ which converges to the length of the 
shortest memory word can exist. In addition we will see how 
this gives a brief proof of another negative result of ours from 
ifTSl concerning estimators that are only defined along some 
sequence of stopping times. 

II. Estimating the Length of the Longest Minimal 
Memory Word of a Process 

Let {X n }^ ) = _ aD be a stationary and ergodic time series taking 
values from a discrete (finite or countably infinite) alphabet X. 
(Note that all stationary time series {X n }^ =0 can be thought to 
be a two sided time series, that is, {X n }'^Z_ 00 . ) For notational 
convenience, let X r ™ = (X m , . . . , X n ), where m < n. Note 
that if m > n then X^ is the empty string. 
Let p(x°_ k ) and p(y\x°_ k ) denote the distribution P(X°_ k = 



.) and the conditional distribution P(X\ = y\X®_ } 



respectively. 

Definition 2.1: We say that 



fe+1 



p{w_ k+1 ) > and for all i > 1, all y e X, all z_ 

p(y\w°_ k+1 ) = p(y\zZl_ i+1 ,w°_ k+1 ) 



is a memory word if 

e X 1 



provided p(z 



-k 

k- 



u -k+i' V) > 0. If no proper suffix of w 



is a memory word then w is called a minimal memory word 
Define the 
k, that is, 



Define the set Wk of those memory words w°_ k+1 with length 



-k+l 



£ X k : w°_ k+1 is a memory word}. 



A discrete alphabet stationary time series is said to be a 
Markov chain if for some finite K > 0, P(X a _ K e %) = 1 
and the smallest such K is called the order of the Markov 
chain. For Markov chains the order is the length of the longest 
minimal memory word. 

In general we can define a function K which will give us 
the length of the minimal memory word of a sequence of past 
observations. 

Definition 2.2: For a stationary time series {X n } the (ran- 
dom) length K(X _ oo ) of the memory of the sample path 
1°^, is the smallest possible < K < oo such that for 
all % > 1, all y € X, all zZ§ e X 1 

p{y\X°_ K+1 ) =p(y\z 



K yO \ 

K-i+n a -k+i) 



provided p(zZ%_ i+v X°_ K+1 ,y) > 0, and ifpf^) = oo if 
there is no such K. 
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Our goal in this section is to estimate the essential supre- 
mum of this function. This is a simple numerical function 
which depends on the process and not on any particular 
realization of the process. In contrast, in fl31 . we addressed the 
problem of estimating the minimal length of a memory word 
which occurs as the suffix of the first n observations of the 
process. This varies of course with n and with the realization. 

For k > let S k denote the support of the distribution of 



Define 



S k = {x°_ k G X k+1 : p(x°_ k ) > 0}. 



A fc = sup sup \p(x\z°_ k+1 ) -p(a;|z° fc _ i+1 )| . 

If for some k, A k = then the process is a Markov chain and 
the least such k is the order of the chain. We need to define 
a statistic to estimate A k , To this end let 



Pn(x\z°_ 



k+U 



{#{k 



1 < t < n - 1 : Xl+ l k+l 



(z°_ k+1 ,x)}-iy 



- 1 < t < n - 1 



-k+ 



J-l) 



where 0/0 is defined as 0. We subtract one for technical 
reasons which does not effect its properties we need here. 
These empirical distributions, as well as the sets we are 
about to introduce are functions of Xff, but we suppress this 
dependence to keep the notation manageable. 
For a fixed < 7 < 1 let S k denote the set of strings with 
length k + 1 which appear more than n 1-7 times in Xq . That 
is, 

<S£ = {x\ G X k+1 :#{k<t<n: X\_ k = x°_ k } > n 1 " 7 }. 



order A k > 0. If the process is not a Markov chain with any 
finite order then A k > for all k. Thus by (Q]) if the process 
is not Markov then Xn — * 00 and if it is Markov then \n is 
greater than or equal to the order eventually almost surely. We 
have to show that if the process is a Markov chain of order 
K then \n is eventually almost surely at most K. 
Let us suppose that the process is indeed a Markov chain with 
order K. Recall the simple fact that the letters u.- L that follow 
the successive occurrences of a word w with length K are 
independent and identically distributed random variables (cf. 
Lemma 1 in Morvai and Weiss lfT31 ). Since the alphabet may 
be infinite we can't take into consideration all possible words 
in our estimation of the undesirable event (x n > K). Instead 
we restrict attention to the words that actually occur and so 
we fix a location (I — K, I] in the index set and then fix a 
word w°_ K , j that occurs there together with a particular state 
x that follows it. The random times I + A+ and / — AT are the 
other occurrences of this memory word in the process. Here 



is the formal definition. Set A 



l,K,0 



0, A 



l.K.O 



and define 



A l,K,i — A l,KA-l 



mm{t >0:X" l + K -- 1 



Kzl -K+l+t 



and 



7. A'. 



= A 



LK,'. 



mm{t>0:x' f^" 1 * = x\ JV'*- 1 



These are the strings which occur sufficiently often so that we 

can rely on their empirical distribution. 

Finally, define the empirical version of A& as follows: 

A£ = max max \p„(x\z°_ k+1 ) - p n (x\z°_ fc _ i+1 )| 

Let us agree by convention that if the smallest of the sets over 
which we are maximizing is empty then Aj? = 0. 

Observe, that by ergodicity, for any fixed k, 



lim inf A k > A k almost surely. 



(1) 



Assume w°_ K+1 is any word and a; is a letter. Then by Lemma 
1 in Morvai and Weiss [15] for i,j > 1, 



X, 



are conditionally independent and identically distributed ran- 
dom variables given X\_ K+1 = w°_ K+1 , Xi + i = x, where 
the identical distribution is p(-\w°_ K+1 ). 



We define an estimate \n f° r the order from samples Xq as 
follows. Let < (3 < ^-J 7 - be arbitrary. Set xo — 0, and 
for n > 1 let Xn be the smallest < k n < n such that 
A£ < n~P if there is such a k and n otherwise. 



Letting n > K we proceed to estimate the probability of 
the undesirable event (x n > K). 

Observe that by our definition of Xn we have 



Theorem 2.1: For any ergodic, stationary process {X n } the 
sequence of estimators Xn converges almost surely to the 
essential supremum of the memory function K. 



P(Xn >K)< P(A A - > n-P) <J2 P ( max 

i=l ( Z -K-i + l' X ) eS K+, 

-i)\>n- p ) 



\Pn(x\z°_ K+1 ) -p n (x\z°_ 



K-i+1 



Proof: 

If the process is a Markov chain, it is immediate that for all 
k greater than or equal the order, A^ = 0. For k less than the 



where the second inequality follows from our assumption 
on the order of the chain. 
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The last term satisfies: 



/ max 



\Pn(x\z°_ K+1 ) - Pn(x\z°_ K _ l+1 )\ > 71 P ) 
n 

< }^ P( max 

\p„(x\z°_ K+1 ) - p(x\z°_ K+1 )\ > 0.5n~ fi ) 

n 

+ P( max 

i = l ( Z -K-i + l' X ) GS K + i 

\p(x\z_ K _ i+ i, Z_ K+1 ) — p n {x\z_ K _ i+ x)\ 

> O.5n- ) 



(2) 



We continue with the estimation of the last two sum- 
mands. For a given K — 1 < / < n — 1 assume that 
= w°_ K+1 x. By Lemma |4~T1 (Hoeff ding's inequality) 
in the Appendix for sums of bounded independent random 
variables implies 

/ 



P 



EU i 



i + j 



Multiplying both sides by P(X i '+] s:+1 = 
summing over all possible words w°_ K+1 and x we get that 

^ ( X i-k+i e S K+n 

ELi +1 =x l+1 } + ELi !{x + =x !+1 



-0.5n _2{i (i+j) 



j° x+1 a;) and 



minimal memory words is finite or infinite. In other words, 
there is no sequence of functions which will converge to Yes 
in case the observed process is Markov with some finite but 
unknown order and to No otherwise. The result we have just 
established does not contradict this since our estimators give 
numbers rather than just two values. There is a more detailed 
discussion of this phenomenon in fl2l . 

III. Limitations on Estimating the Length of the 
Shortest Memory Word of a Second Order 
Markov Chain 

The next theorem shows that even when we restrict attention 
to second order Markov chains there is no universal estimator 
for the length of the shortest memory word. 

Theorem 3.1: Let X = {0,1,2,...}. For any estimator 
{h n (Xo, . . . , X n )} such that for all stationary and ergodic 
second order Markov chains taking values from X with 
minimum length of memory words being equal to one 

limsnp P(h n (X , . . . , X n ) = 1) = 1 

n — >oo 

there exists a stationary and ergodic second order Markov 
chain taking values from X with minimum length of memory 
words being equal to two such that 



lim sup P(h n (Xo 



,X n ) = !) = !. 



i+j 



- p{X l+l \X\_ K+1 )\>Q.bn-P) 

Summing over all if — 1 < I < n — 1 and over all pairs 
such that i > 0, j > 0, i + j > [n 1 'J we get that 

P ( For some K - 1 < I < n - 1 : X l ^ +1 E S% +1 , 

\p n {X l+l \X\_ K+1 ) -p{Xi +l \X\_ K+1 )\ > 0.5n-") 

oo 

< n 2 Yl h2e-°- 5n ~ 20h . 

h=ln 1 —rj 

Applying this final inequality to each of the terms in (0 we 
get that 

oo 

P(Xn >K)< 4n 3 he°- 5n ~ 2 " h 

h=ln 1 -"H 

The right hand side is summable given < [3 < and then 
by the Borel-Cantelli lemma a.s. the undesirable event occurs 
only finitely many times and thus the proof of Theorem 12.11 
is complete. 

Remark. Bailey [1| showed that one can not discriminate 
between processes where the supremum of the lengths of the 



Proof: As is customary in proofs of this type of theorem we 
construct the problematic Markov chain by a sequence of steps 
in which at intermediate stages we will have a second order 
jMarkov chain with some memory words of length one. The 
-point is that these memory word occur very infrequently so 
that a very small modification of the process suffices to destroy 
one of these while preserving the others. This modification is 
small enough so as not to change some finite distributions by 
too much so that all of these features will be present in the 
limiting process. To keep the technical details to a minimum 
we start with one Markov chain and all of our modification 
are functions of the initial chain. Here are the details. 

First we define a Markov-chain (cf. Ryabko ifTTl ) which 
serves as the technical tool for construction of our counterex- 
ample. Let the state space S be the non-negative integers. 
From state the process certainly passes to state 1 and then 
to state 2, at the following epoch. From each state s > 2, the 
Markov chain passes either to state or to state s + 1 with 
equal probabilities 0.5. This construction yields a stationary 
and ergodic Markov chain {Mi} with stationary distribution 



and 



P(M = 0) = P(M = 1) = ~ 



P(M = i) = — for i > 2. 



Now let / (0) (0) = / (0) (1) = and for all s > 2 let 
/(°)(s) = s. The process {A"f 0) = / (0) (^)} is a stationary 
ergodic countable alphabet second order Markov-chain with 
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minimum length of the memory words being equal to one. The resulting process {X^ = /^(Mj)} is a stationary 
Set Nq = 1. Let no > Nq be so large such that ergodic countable alphabet second order Markov-chain with 

minimum length of the memory words being equal to one and 



/ \ 1 minimum icngui 1 

P (h no (X< 0) , 4:>) = 1[X<°> = X[ 0) = 0) > 1 - -. for all < k < j 



Now, observe that P (h nh _ k (xg\ . . . , X^_ k ) = 1 1 X^ = X« = 

• for Nq < s : f^°\s) — s and s is a memory word, 



: s < N }r\{f {0) (s) ■ N < s} = for some -k <i <0) > I - (^) 

for < s < Nn : f^(s) is not a memorv word. \ / 



fc+i 



> for < s < No '■ /^( s ) * s not a mem ory word. 

Put Nj = 2rij-i —Nj—i and let rij > Nj +j be so large such 

Now we define the function f^K For all < s < no and that 

for all 2n Q - N a + 1 < s define P fh _-(X (j) X U) ) = 1\X U) = X U) = 

/ (1) (*H/ (0) (*) V ^ ° ""' ^ ly + l +1 

* j* ii „ / A , , , ,rnw \ f° r some — j < i < 0) > 1 — ( - ) 

and for all no < s < n + (n - No) let f {0, (s) = n - j - - i y 2 J 

(a — no) + 1. The resulting process jX,*- 1 ^ = fWfMj)) is a , 

\ t - ' . ui 1 t. u ; j j x/r i For the function /W just defined, 

stationary ergodic countable alphabet second order Markov- , ■ 

chain with minimum length of the memory words being equal ■ the process /« (M n ) is a stationary and ergodic count- 
to one and arj ^ e alphabet second order Markov chain with minimum 

, length of the memory words being equal to one, 

P [h no (X^, . . .,X$) = 1\XM = x[ 1] = Qj > 1 - -. . for < ft < j and < a < n k : /«-*)(«) = 

« for < k < j: 

Put Ni = 2n Q - N and let m > Ni + 1 be so large that 

,(1) x (l) s = 1 x (l) = x (l) = „ P (hn k -k(Xfr\ • • ■ - -^nk-fc) = l \ X< f ] = X i+l = 



P [h ni -i(X , . . . ,X ni _-)) — l\X, L — X i+l - 



2 / \ fc+1 

for some -1 < i < 0) > 1 - (-') . for some < * < 0) > 1 - ( - 



2> ■ V2 



Indeed, there exists such an m since by assumption, • for Nj < s : f^(s) = s and s is a memory word, 

/ , (1) m , x • {fV\s):s<N j }r\{fW(.8):N j <8} = Q, 

limsupP [h n (X . . . , Xy ) = 1J = 1. . for < s < iVj is not a memory word. 

Now, observe that 



. . Eventually, we defined a function f(s) — lim^oo f k (s) and a 

< s : ( s ^7 „ s n ^ nd 3 * a memor >; word ' process X n = /(M n ) which is a stationary ergodic countable 

alphabet second order Markov-chain with minimum length of 



{f {1) (s): S <N 1 }f]{f {1) ( S ):N 1 <s} 



. for < a < Ni : is not a memory word. ^ memory WQrds bdng £qual tQ TWQ ^ fof all Q < fc 

Now we define the function inductively. Assume we P (hn k -k(Xo, ■ ■ ■ , X nk _f.) = = -^Q+i = 

have already defined positive integers Nk, n k > N k + k and / 1 \ k+1 

functions for < k < j - 1 with the following properties: for some -^<*<0)>1— 1-1 

. each process xi k) = f {k) (M n ) is a stationary and -ru u „ 
ergodic countable alphabet second order Markov chain 

with minimum length of the memory words being equal , , , / /l x 



P(h nk - k {Xo,...,X nk - k ) = l)> 1- 
to one, k j i \ i j 



P (X, = X i+1 = for some -k < i < 0) . 
,fc+i Since 



. for < s < njfe, = / (fc) ( s )- 

. p (h^xt^, ■ ■ ■ = = ^7 1} = 

for some -fc < i < 0) > 1 — (|) 
• for ATj_i < s : /^ _1 ^(s) = s and s is a memory word, lim P (Xj = = for some — k < i < 0) = 1 

. {/( fe )( S ): S <7V fc }n{/ (fc) (5):A r fc<4 = 1 ' 

. for < s < iV,_i f( j -V(a) is not a memory word. we get that 

limsupP(/i n (X ,...,X n ) = 1) = 1. 
Now we define the function /'•". For all < s < rij and n^oo 

for all 2n J _i - 7V,_i + 1 < s define The proof of Theorem Q is complete. 

/ W (s) = f^Hs) 

The following theorem has been proved in Morvai and 
and for all rij-i < a < n.j-i + (rij_i — Nj—i) let f^'(a) — Weiss lfT31 (Theorem 6). Here we give a simpler proof of 



rij-i — (s — %-i) + 1. it based on Theorem I3T 
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Theorem 3.2: Let X = {0, 1,2,.. .}. There are no strictly 
increasing sequence of stopping times {A„} and estimators 
{Ii(Xq, . . . ,X\ n )} taking the values one and two, such that 
for all second order Markov chains taking values from X: 

lim — = 1 

n^QG n 

and 

lim \h(X , XxJ- K{X$ n )\ = with probability one. 
Proof: 

We argue by contradiction. Assume that Theorem 13.21 does 
not hold. Then define 

l n (X ,...,X n ) = min h(X , ■ ■ ■ , X\ k ). 

0.5n<Afc <n 

Now, by assumption l n (Xo, . . . ,X n ) would be a pointwise 
consistent estimate for the length of the shortest memory word 
which contradicts Theorem 13.11 The proof of Theorem 13.21 is 
complete. 

Remark. For a positive result using stopping times cf. 
Theorem 4 of [15|. It shows that for any positive e there is a 
sequence of stopping times A n which with probability one will 
have density at least 1 — e and along which we can successfully 
estimate K(X n ). For more on the use of stopping times in 
universal estimation see the recent survey [16] and J7), JU, 

aa, ma, ma, eh. 
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IV. Appendix 

The next lemma is due to Hoeffding, cf. j6). 

Lemma 4.1: (Hoeffding's inequality, Hoeffding 1963) Let 
X\ , X2, ■ ■ ■ , X n be independent real valued random variables, 
and ai, 61, . . . , a n , b n be real numbers such that a; < Xi < bi 
with probability one for all 1 < i < n. Then, for all e > 0, 

< 2e -( 2 " e2 ^EL 1 l b -- Q -l 2 ). 
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